AD/ A-005 

692 

THE  OPTIMAL  SELECTION  OF 
INDICES  FOR  FILES 

SECONDARY 

Mario  Schkolnick 

Carnegie  - Mellon  University 

Prepared  for: 

Air  Force  Office  of  Scientific  Research 
Defense  Advanced  Research  Projects  Agency 


N o V e mb  e r 19  7 4 


DISTRIBUTED  BY: 


National  Technical  Information  Service 
U.  S.  DEPARTMENT  OF  COMMERCE 


UNCUSSIFIED 

SECUniTV  CLASSIFICATION  OF  THIS  PAGE  ^Wntn  Dmf  EnUfd) 


I 


REPORT  DOCUMENTATIC  • PAGE 


1.  REPORT  number 


-TP  - 75'  0196 


12.  GOVT  ACCESSION  NO. 


READ  INSTRUCT’  NS 
BEFORE  COMPLETING 


4.  title  fwid  Subr/(l*;  S.  TfPE  OF 

THE  OPTIMAL  SELECTION  OF  SECONDARY  INDICES 
FOR  FILES  Interim 


5.  TfPE  OF  report  a PERIOD  COVERED 


C PERrORMING  ORG  REPORT  NUMBER 


7.  author^*; 


Mario  Schkolnick 


•.  contract  or  grant  number/’*; 

F44620-73-C-0074 


>.  performing  organization  name  and  address 
Carnegie -Me lion  University 
Computer  Science Oept. 
Pittsburgh,  PA  15213 


<1  controlling  office  name  and  aodress 

Defense  Advanced  Research  Projects  Agency 
1^00  Wilson  Blvd. 

Ai^lington,  VA  22209 


14.  monitoring  agency  name  a ADDRESSril  d/r/ar*n(  Itom  Conitollini  OUlct) 

Air  Force  Office  of  Scientific  Research  (NM) 
1400  Wilson  Blvd. 

Arlington,  VA  22209 


16.  DISTRIBUTION  STATEMENT  (ol  Ihlf  Rmpotl) 


10.  PROGRAM  ELEMENT.  PROJECT.  TASK 
AREA  a WORK  UNIT  NUMBERS 

61101D 

AO-2466 


12.  REPORT  DATE 

November  1974 


II.  NUMBER  OF  pages 
16 


IS.  security  class  (ol  thl3  rmport) 

UNCUSSIFIED 


IS*.  DECLASSIFICATION  DOWNGRADING 
SCHEDULE 


Approved  for  Public  Release;  Distribution  Unlimited 


17.  distribution  STATEMENT  (ol  Iho  obtirsel  onlond  In  Block  30,  II  dllloroni  Irom  Roporl) 

l•pf04Jw<«tf  by 

NATIONAL  TECHNICAL 
INFORMATION  SERVICE 

us  Oaparlmant  ol  Commorc* 

Spnngl.old.  VA.  22151 


It.  supplementary  notes 


mas  syiJEa  to  change 


K£Y  WORDS  ^Conttnu0  on  U and  tdontUy  by  block  numbor) 


20.  abstract  ^ronfintyi  on  If  fnd  Identify  bv  bipek  number) 

We  consider  tne  problem  of  finding  an  optimal  set  of  indices  for  a file.  A 
general  model  for  a file  is  assumed  together  with  a probabilistic  model  of 
the  transactions  conducted  with  it:  Queries,  Updates,  Insertions  and  Delectic 

It  is  shown  that  all  the  information  assumed  for  each  attribute  can  be  con- 
densed into  two  parameters  and  that  properties  of  the  optimal  solution  can  be 
derived  from  this  condensed  information.  An  algorithm  to  find  the  optimal  set 
of  indices  based  on  these  properties  is  exhibited. 


DD  I jan'?!  1473  edition  OF  I NOV  6S  IS  obsolete 


UNCUSSIFIED 


SECURITY  CLASSIFICATION  OF  THIS  PAGE  rl*7i»n  Dolm  Ent^rrd) 


I 

( 


THE  OPTIMAL  SELECTION  OF  SECONDARY  INDICES  FOR  FILES 

Mcirio  SchKolnicK 
Carncgic-Mellon  University 
PiflsLurch,  Pa.  15213 

November  1974 


Apprcvf  .';  : r I '-n  rolcaje  ; 
distribution  i.r.-I-.’.itea. 

ABSTRACT 

We  consider  the  problem  of  finding  an  optimal  set  of  indices  for  a file.  A general 
model  for  a file  is  assumed  together  with  a probabilistic  model  of  the  transactions  conduMed 
with  it:  Queries,  Updates,  Insertions  and  Deletions.  It  is  shown  that  all  the  information 
assumed  for  each  attribute  can  be  condensed  into  two  parameters  and  that  properties  of  the 
optimal  solution  can  be  derived  from  this  condensed  information.  An  algorithm  to  find  the 
optimal  set  of  indices  based  on  these  properties  is  exhibited. 
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INTF^ODUCTION 

We  consider  here  the  problem  of  selecting  a sot  of  attributes  for  which  a secorxfary 
index  should  be  provided  in  order  to  minimize  the  expected  cost  por  transaction  conducted 
with  a file. 

A proper  solution  to  ths  problem  has  to  coiisidor  the  system  in  which  the  file  will 
exist  as  well  as  characteristics  sucti  as  accessing  mechanisms  to  it  and  stat'sticai  properties 
of  the  transactions  that  are  conducted  with  the  file.  Some  systems  winch  have  been 
implemented  use  the  simple  approach  of  providing  indees  for  all  attributes  [2]  while  others 
do  not  provide  indices  at  all  [6].  For  these  systems  the  optimization  prob'em  does  not  exist. 
Between  these  two  extreme  approaches  there  arc  systems  like  XRM  [8]  which  a'  ow  the 
user  to  specify  which  domains  of  the  relation  should  be  indexed.  Some  high  level  la.'-guages 
have  been  designed  [4]  for  which  llio  system,  while  in  tlie  process  of  perfcr-.ing  a 
transaction,  creates  temporary  indices  for  some  atlnbutes.  After  the  transaction  is 
completed,  the  user  is  informed  of  the  newly  created  indices  and  may  then  choose  to  Keep 
them  or  delete  them.  Systems  such  as  these  and  oth.ers  for  which  on  lire  transactions  may 
be  conducted  with  a file  by  users  whoso  demands  on  the  file  change  in  tmie  are  host  suited 
for  solutions  as  described  here.  After  collecting  statistical  properties  of  the  transaction  that 
the  current  set  of  users  conducts  with  the  file,  the  system  may  compute  an  optimat  set  of 
indices  to  minimize  transaction  processing  time.  T.hcn,  it  creates  its  optimal  ir^Jex  set 
throwing  away  those  indices  which  contribute  to  increase  transaction  time  and  creating 
indices  which  help  to  decrease  transaction  time.  Since  this  overhead  processing  may  be 
appreciable,  the  system  will  probably  perform  this  updating  only  at  some  fixed  intervals  of 
time. 

Recently,  a number  of  stucfies  have  appeared  in  the  literature  which  convder  this 
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problem.  Some  of  them  have  taken  an  empirical  approach  [9]  while  others  [7],  [tO],  [14] 
attempt  to  formalize  the  problem  to  obtain  an  analytic  solution.  Since  th?  data  base  usually 

' ( exists  in  a complex  environment  there  are  innumerable  factors  that  will  influence  the  index 

1 

f selection  problem  and,  in  order  to  hope  for  a solution,  some  assumptions  must  be  made.  In 

I 

I [7]  the  restriction  was  made  that  transactions  could  only  specify  one  attribulo  value  in  their 

I specification  part  while  in  [14]  the  transactions  were  reduced  to  queries  and  updates  only 

and  the  statistical  prooerties  considered  were  minimal.  We  will  now  present  a model  which 
I encompasses  a variety  of  situations  by  allowing  great  flexibility  in  the  specification  of  the 

statistical  properties  of  the  transactions  as  well  as  on  the  storage  organization  and  retrieval 
mechanism  for  the  indices.  The  inclusion  or  exclusion  of  an  attribute  in  an  optimal  set  of 
indices  will  depend  on  two  parameters  which  are  derived  for  each  attribute  when  the 
specific  properties  mentioned  above  are  Knovn  to  the  system.  By  studying  properties  of 
the  optimal  solution  we  arc  able  to  describe  an  efficient  algorithm  which  makes  use  of  these 
pairs  of  parameters  to  determine  the  optimal  solution. 

SECTION  1 

I 

In  this  section  we  present  the  model  of  the  file,  the  assumptions  on  storage 
organization  and  retrieval,  and  the  types  of  transactions  conducted  with  it. 

We  will  assume  a relational  model  for  a data  base  [5]  and  we  will  consider  the 
problem  of  index  selection  v/hen  tlicre  is  a single  relation  in  the  data  base  (the  results 
shown  here  can  be  directly  extended  to  a multi-rolational  data  ba^e  provided  we  assume 
' independence  between  them. 

I 

The  file  F will  consist  of  a set  of  N vectors  (or  records)  v = (vj  ,V2,..,v^)  where 
I each  Vj  ( Aj,  the  j th  attribute.  Thus  F c AjxA2x  xA^^.  We  also  assume  the  existence  of 

atoms  for  the  domains  [t5].  Each  atlnbute  will  then  be  a finite  set  whose  elements  can 
I appear  in  a transaction. 
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We  will  assume  that  the  use  of  secondary  indices  is  the  only  mechanism  that  will  be 
used  to  facilitate  the  search  for  records  in  the  file.  Thus,  we  will  consider  here  that  the  file 
is  randomly  stored  in  secondary  memory  and  is  not  clustt-ied  according  to  any  criteria.  The 
approach  of  clustering  records  has  been  studied  in  the  literature  [1  1]  but  cannot  be  token 
when  we  assume  that  the  data  base  is  used  by  several  users  with  different  requirements  on 
it.  Thus,  the  time  required  to  retrie.e  n records  will  be  the  time  to  bring  the  pages  which 
contain  them.  Since  the  records  are  assumed  to  be  randomly  distributed  and  that  selection 
of  an  optimal  set  of  indices  will  result  in  a small  expected  number  of  records  to  be  accessed 
A/hen  processing  a transaction,  this  time  will  be  cn,  for  some  constant  c which  measures  the 
time  required  to  bring  a page  to  mam  memory. 

Since  the  method  for  storing  and  retrieving  indices  is  a fundamental  characteristic  of 
a system  using  secondary  indices  for  processing  trar  tactions  we  will  not  impose  any 
restriction  on  it.  The  only  structure  that  we  impose  is  that,  if  an  index  Ij  for  the  j-th 
attribute  exists,  knowledge  of  a value  a ( Aj  will  produce  a list  of  pointers  to  records  (which 
we  will  in  the  sequel  refer  to  as  tid  for  tuple  identifier)  in  time  f(),a)  for  some  function  f. 
For  example,  if  the  indices  have  the  usual  two  level  hierarchical  organization  and  are  stored 
on  a disk,  f(j,a)  could  be  taken  to  be  C|*C2(Jj(a)N  where  C[  is  the  constant  time  required  to 
find  the  head  of  the  appropriate  list  of  tid's,  C2  is  the  transfer  time  per  fid  and  ^Jj(a)  gives 
the  proportion  of  the  totat  number  H of  records  that  have  the  value  a as  its  j-fh  attribute. 

Let  N be  the  size  of  the  file.  Even  though  deletions  and  insertions  will  be  allowed, 
we  assume  that  the  size  of  the  file  remains  fixed  throughout  the  period  between  successive 
computations  of  optimal  sets  of  indices.  For  eacti  attribute  we  will  assume  a distribution 
wtiich  gives  the  proportion  (3^(a)  of  records  m the  file  having  a particular  value  a < Aj  as 
its  j-fh  attribute.  (One  simplifying  assumption  that  sometimer  is  made  is  to  consider  a 

constant  function.) 
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There  will  be  four  types  of  transactions  conducted  with  the  file:  Queries,  Updates, 
Insertions  and  Deletions.  It  is  frequently  the  case  that  an  Update  or  a Deletion  is  specified 
in  two  components:  a selection  component  which  determines  a set  of  records  to  be 

processed,  and  an  action  component  which,  in  the  case  of  Updates  determines  how  each 
record  in  the  set  is  to  be  updated,  and  in  the  case  of  a Deletion  is  null,  sirrply  meaning  that 
that  set  of  records  which  was  found  is  to  be  removed  from  the  file.  As  will  be  discussed 
below,  finding  the  relevant  records  implies  a number  of  actions.  After  they  are  done,  we 
can  no  longer  assume  that  any  part  of  an  index  is  in  mam  memory  so  that  the  updating  on 
the  indices  that  results  due  to  an  Update  or  a Deletion  can  be  assumed  to  be  independent  of 
the  processing  of  the  selection  part  of  the  corresponding  transaction.  For  this  reason  we  do 
not  include  a selection  part  in  the  formal  specification  of  an  Updste  or  Deletioa 

A Query  Q will  be  specified  as  Q = (q|  ,q2,...,q^)  where  qj  ( Aj  U {x},  x is  a symbol 
not  in  any  set  Aj.  If  q^  ( Aj,  we  say  that  the  i-th  attribute  is  specified  in  the  query.  Note 
that  a query  to  our  model  can  result  frorr.  a true  query  in  the  system,  or  an  update  or  a 
deletion  as  discussed  above.  An  actual  query  is  sometimes  accompanied  by  an  output  part 
which  specifies  that  some  attributes  of  the  set  of  records  found  to  satisfy  the  query  are  to 
be  displayed  or  further  processed.  These  operations  take  time  indepenoent  of  the  set  of 
attributes  for  which  an  index  exists  and  thus  can  be  safely  ignored  for  our  analysis. 

We  will  assume,  as  is  done  in  most  systems,  that  a query  is  processed  as  follows: 

1.  For  each  specified  attribute  ;,  for  which  an  index  exists,  a list  Lj  is  found  containing 
all  tids  of  records  whose  j-th  attribute  have  the  specified  value. 

2.  From  all  lists  L^,  an  intersection  list  L is  formed  L = f1  L^.  This  list  contains  all  tids  of 
records  all  of  whose  attributes  for  which  an  index  exists  and  which  are  specified  in 
the  query  have  the  specified  value.  These  records  will  be  referred  to  as  partially 
qualified  records. 
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3.  All  partially  qualified  records  are  brouoht  to  main  memory  where  the  values  of  those 
specified  attributes  tor  which  there  is  no  index  are  checked.  Those  records  which 
are  found  not  to  satisfy  the  query  are  disregarded  (these  are  sometimes  referred  to 
as  false  drops)  and  a list  of  qualifying  records  (or  their  tids)  is  obtained  which  can 
be  further  processed.  As  wac  mentioned  before  any  further  operation  is 
independent  of  the  chosen  index  set  and  will  not  be  considered  any  further. 

Let  Oj(a)  be  the  probability  that  the  j-th  attribute  is  specified  in  a query  to  have  the 
value  a < Aj.  Fo’’  convenience  we  will  denote  as  pj  the  probability  that  the  j-th  attribute  is 
specified  in  a query.  Thus, 


P;  “ ^ 

^ acAj  > 

The  expected  cost  to  process  step  1 above  is  given  by 

S I a;(a)f(j,a) 

j<0  a(Aj  J 

where  D is  the  set  of  attributes  for  which  an  index  exists.  The  cost  of  step  2 can 
be  considered  negligible  as  compared  to  the  cost  of  I and  3.  The  reason  for  this  is  that  the 
intersection  list  can  be  constructed  in  main  memory  and  any  processing  time  spent  here  is 
small  compared  with  the  cost  of  interacting  with  secondary  storage.  It  is  not  hard  to  see 
that  the  expected  number  of  tids  in  the  resulting  intersection  list  L is  given  by 

ILI-N.  ,j|^[l  -p|.  (3|(a)) 

Step  3,  as  explained  at  the  beginning  of  this  section  is  proportional  ic<  thi  length  of 
L,  c|L|.  The  cost  of  removing  the  false  droos  to  form  the  final  list  < an  be  considered,  as  in 
step  2,  to  be  negligible  since  it  is  done  in  mam  memory.  Thus  the  expected  cost  to  process 
a query  is  given  by 

Cq  = ^ X a (a)  f(j,a)  ♦ cN  . n [1  - p;  ♦ I a (a)  0 (a)j 

^ j(D  aCAj  * )CD  > a(A|  > > ' 
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An  Update  U will  bo  specified  U - ;v:uj,U2,...,u^^)  where  each  uj  ( Aj  U {x}  and 
V ( F.  The  intenefed  meaning  for  an  update  is  that  record  v (which  can  be  identified  by  its 
tid)  has  to  be  updated  on  those  specified  attributes  A,  (i.e.,  'hose  for  which  Uj  ^ x)  to  have 
the  new  value  Uj. 

To  process  an  Update,  one  has  to  (1)  retrieve  the  record  v,  update  its  specified 
attributes,  and  store  it  back  again  and  (2)  update  all  relevant  indices.  The  time  required  to 
perform  (1)  is  independent  of  the  existing  set  of  indices  and  thus  we  do  not  consider  it. 
Updating  the  indices  requires  reading  and  writing  back  the  bucket(»)  for  the  old  value  of  the 
attribute  and  doing  the  same  for  the  now  value.  We  will  assume  that  both  these  operations 
are  performed  even  if  these  buckets  coincide.  (As  it  turns  out,  consideration  of  this  fact 
would  only  resu't  in  a slightly  more  complicated  expression  for  the  parameter  K(i)  associated 
to  the  i-th  attribute  (see  Section  2),  but  does  not  otherwise  change  the  nature  of  the 
algorithm  to  find  an  optimal  sot  of  indices.) 

Let  f (j,a)  be  the  tim.e  required  to  read  and  write  back  the  "a"  bucket  for  the  j-th 
index.  Then,  the  expected  cost  o(  an  update  on  the  j-th  index  is 

^ a;(a)f(j.a)>f'(j.b)) 

b<Aj  > a<Aj  ‘ 

where  the  first  term  inside  the  square  bracket  is  the  cost  of  updating  the  old  bucket 
(assuming  that  the  distribution  of  the  number  of  tids  on  all  buckets  of  the  j-th  index  is  given 
by  and  the  second  term  is  the  cost  of  updating  the  new  bucket.  Also  we  assume  a 
probability  Mj(b)  that  the  value  b < Aj  for  the  j-th  attribute  will  be  specified  in  an  Update. 
Thus,  if  wo  have  a set  0 of  indices,  the  expected  cost  of  an  Update  is  given  by 
" j<^  a5\' where 

V:  - I Fj(b)  is  the  probability  that  the  j-th  attribute  is  specified  in  an  update. 

‘ b<Aj  ‘ 


« A bucket  IS  the  set  of  all  tids  of  tuples  having  the  same  value  on  an  indexed  attribute. 


Insertion'^.:  An  inccrtion  is  specified  as  I(v).  It  requires  insertion  of  the  record 
itself  plus  the  updating  of  all  indices.  As  vyas  the  case  for  updates  the  first  cost  is 


independent  of  the  index  set.  Assuming  a distribution  of  values  given  by  the  cost  of  the 
second  component  is 

J 2 (I  (a)f'(j,a). 

j<D  a<Aj  J 

Deletion:  A deletion  T(v)  of  the  record  v requires  a similar  set  of  operations  as  an 
insertion  and  the  expression  for  the  resulting  cost  is  the  same,  Cy  = Cj. 

Combining  all  expressions  obtained  above,  v/e  get  that  the  e^rpected  cost  per 
transaction  is  given  by:  E(D)  = fgCQ  ♦ r^Cy  ♦ rjCj  ♦ rjCj  where  rg,  ry,  rj  and  rj  are 
respectively,  the  probabilities  that  the  transaction  is  a Query,  an  Update,  an  Insertion  or  a 
Deletion.  This  expression  can  also  be  written  as 

E(D)  = I H(j)  ♦ G(D)  U)  .where 

j(D 

H(j)  - I^{rgaj(a)f(j,a)  ♦ [ryOfj(Jj(a)  ♦ p^ta))  ♦ (rj*rj}0j(a)]f'(j,a',J  and 

G(D)  * rycN  n [1  - pj  ♦ 1 Ojfa)  (I  (a)] 

^ j<D  > a<Aj  f I 

The  problem  of  finding  an  optimal  set  of  indices  can  .he  now  forma'Iy  defined  as  that 
of  finding  a smallest  set  D £ M = {l,2,...,m}  which  minimizes  the  bove  expression  for  E(D). 

SECTION  2 

Analysis  of  t^  CosJ  Function 

A straightforward  evaluation  of  E(D)  for  all  subsets  of  M would  certainly  solve  the 
optimization  problem.  We  zre  interested  in  finding  algorithms  which  take  less  than  2^  to 
obtain  the  optimal  set.  Another  approach  would  be  to  construct  chains  of  sets 
Do'4>,Dj  with  each  D,»|  a superset  of  D,  obtained  by  adding  one  more  ele.mer.t  of 

M,  such  that  E(Dj)  resulted  in  a nonincrcasing  sequence.  Proceeding  in  this  manner,  we  could 


find  a collection  of  locally  optimal  sets  D (i.e.,  all  sets  D such  that,  for  all  j,  E(D)  < E(D  U {j}) 
and  E(D)  < E(D  - {j}).  In  general,  by  following  this  procedure  we  may  not  find  the  optimal 
solution  among  the  collection  of  sets  found.  For  example,  assume  an  optimization  problem 
over  a set  {A,B,C}  whose  cost  function  E'(D)  is  such  that  E‘(4>)  » 5,  E’({A})  = 6,  E'({B})  = 3, 
E'({C})  = A,  E'({A,B})  = 9,  E'({A,C})  = 7,  E'({B,C})  = 8,  E’({A,B,C})  = 2.  The  above  method 
would  produce  the  collection  {B}  and  {C}  as  local  optimal  solutions  reached  from  4>,  thus 
missing  the  optimal  set  {A,B,C}. 

There  is  a simple  condition  on  a cost  function,  which  we  call  the  regularits'  condition, 
which  suffices  to  guarantee  that  the  above  procedure  will  obtain  the  optimal  solution  among 
the  collection  of  sets  which  finds.  The  condition  states  that,  if  while  performing  the 
procedure,  a set  D is  reached  and  there  is  an  index  j i 0 which  increases  the  cost  function, 
then  the  index  j can  be  ignored  in  any  subsequent  search  from  D.  Formally,  we  have: 

Definition  h Let  E be  a cost  function  defined  on  subsets  o'  a set  S of  points.  Let 
A(D,j)  = E(D  U{j})  - E(D).  Then  E is  said  to  be  regular  if  A{D',j)  > A(D,j)  for  any  point  j and 
sets  D,  D'  for  which  D S O'  and  j 4 O'. 

Note  that  if  E is  regular  and  for  some  0 ai  d j / 0,  A(D,j)  > 0 then,  for  all  0 S O' 
with  j 4 O'  we  have  A(0',j)  > 0. 

The  following  lemma  states  that,  for  a regular  function,  the  above  procedure 
succeeds  in  obtaining  an  optimal  solution. 

Lemma  h Let  E be  a regular  cost  function  and  0 a locally  optimal  set  (i.e., 
E(0  U {k})  > E(D)  and  E(0  - {k})  > E(0)).  Then  E(0q)  > E(0, )>...>  EfO^)  for  all 

sequences  of  sets  O0,O|  ,...,0^,  = 0 satisfying  lD,i  = i. 

Proof:  Wo  have  to  show  that  for  any  subset  O'  of  D,  E(D'  tJ  {j})  < E(D')  for  all 


j ( D-D'  or  equivalently,  A(D',j)  < 0.  Assume  the  contrary,  and  consider  the  set 
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D”  - D - {j}.  Clearly,  D'  £ D".  By  assumption,  A(D',j)  > 0,  which  implies,  since  E is  regular, 

that  A(D",j)  > 0.  This  contradicts  the  fact  that  D is  a local  optimum.  | 

Lemma  1 states  that  if  D is  a local  optimum,  it  will  be  found  by  the  procedure 

described  above  becaus*  for  any  chain  = Dq,D|  ,...,0^^  = D with  ID, I =■  i we  have  that 

E(D0),E(D  j ),...,E(D^)  is  a nonincreasing  sequence.  (We  note  here  that  an  analogous  proof 
shows  that  Lemma  1 also  holds  if  the  sequence  of  sets  is  decreasing,  i.e.,  S ■ DQ,Dj,...,Dp 
satisfying  |Dj|  = 1S|  - i,  so,  in  particular,  a search  which  starts  from  S will  also  find  the  global 
optimum. 

There  is  a class  of  cost  functions  which  includes  our  particular  cost  function,  which 
are  regular.  They  are  characterized  in  the  neift  definition  and  lemma. 

Definition  2:  Let  K be  a function  which  maps  subsets  of  S to  a totally  ordered 
domain  with  order  relation  given  by  < . Then  K is  said  to  be  monotone  nonincreasino  (mni) 
if  D £ O'  implies  K(D')  < K(D). 

Lemma  2:  Let  E be  a cost  function  such  that  A(D,j)  = E(D  IJ  {j})  - E(D)  can  be 
written  as  A(D,j)  = A(j)  - B(D,j)  where,  for  each  fixed  j,  B(D,j)  is  mm.  Then  E is  regular. 

Proof;  Let  D',  D be  subsets  with  D £ D'  and  let  j be  a point  j < D’.  We  have,  by 
definition,  A(D',j)  - A(D,j)  - [A(j)  - B(D',j)]  - [A(j)-B(0,j)]  - D(P,j)  - B(D',j).  Since,  for  fixed 
j,  B(D,j)  is  mni  then  D £ O'  implies  B(D',j)  < B(D,j).  So,  A(0',;i  L A(D,j)  as  required.  | 

Our  first  result  shows  that  the  cost  function  we  are  dealing  with  is  regular. 

Theorem  1 : Let  E(D)  = I H(j)  ♦ G(Di  as  defined  in  (1 ).  Then  E is  regular. 

j<D 


Proof:  By  definition,  A(D,j)  = H(j)  ♦ G(D  IJ  {)})  - G(D)  - 
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Since  . p^,  „ave  that  0 < F(k)  < I (assuming  > 0 and  lA,,;  > I. 

This  assumphon  is  juslified  s.nce  . 0 or  |A^|  . | imply  Rk)  . 0 and  Ihus  A(D,k)  C 0 so 

Ihe  k-lh  allribule  would  not  be  pari  of  any  opt, mat  soluLon.  In  the  se,uel,  we  will  assume 
this  to  hold  for  all  attributes). 

other  classes  of  lunclions  sahslying  Lerrwra  2 (and  Ihus  the  regularily  condihon) 
have  appeared  in  the  lileralure.  In  (3)  Ihe  f<-low,ng  cost  (unction  is  studied  ,n  conneclion 
With  on  cplimal  allocation  of  copies  of  files  in  an  information  network  with  n nodes: 

k^O^*'  * i/i  ^i^^*  where  depends  only 

on  parameters  associated  to  Ihe  k-lh  node  in  the  network  and 

G,(D)  . X-  mm^d*  where  X,  is  a conslani  associated  with  Ihe  i-lh  node  and  d^^ 

m a cosi  associated  lo  Ihe  link  between  nodes  i and  k HO)  is  Ihe  cost  associated  lo 
selecting  Ihe  set  0 of  nodes  as  information  storage  nodes.  A result  like  Lemma  I but 
specialized  to  this  function  was  obtained.  Since 


A(D,j)  - U 


’■  iM*''' ktouijf"' ■ .^.y™d,aWd;s) 


i ■ • ■ djk  idtk 

^ 1=1  ' WD 


where  x - y . d x > y then  x-y  e]^  0 , jf  follows  that  this  cost  function  satisfies  the 

conditions  of  Lemma  2.  This  implies  that  it  is  regular  and  Lemma  1 holds.  Thus,  Theorem  1 
in  [3]  is  obtained  as  a special  case  of  Lemma  2. 

SECTION  3 


Since  our  cost  function  is  regular  we  know  that  a depth  first  search  as  described  in 
Section  2 will  find  the  optimal  solution.  In  this  section  we  will  show  that  we  do  not  need  to 
examine  all  possible  nodes  which  could  be  reached  during  an  unrestricted  search.  Thus  the 
time  required  to  find  the  optimal  set  will  be  reduced.  This  result  will  be  obtained  by 
characterizing  properties  of  the  optimal  so'ution. 


1 1 
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Definition:  Let  Aj  be  the  set  of  attributes  for  our  file.  For  each  Aj,  define 

a tuple  (F(j),K(j))  where  F(j)  is  defined  as  in  the  proof  of  Theorem  1 and  K(j)  = H(j)/F(j)  and 
H(j)  has  also  been  defined.  Thus  we  get  a set  S of  m vectors,  each  having  two  components. 
Let  Sj  = (F{i),K(i)),  and  Sj  = (F(j),K(j))  be  two  such  vectors.  A partial  order  < can  be  defined 
as  follows:  Sj  < Sj  iff  F(i)  < F(j)  and  K(i)  > K(j).  If  Sj  < Sj  then  Sj  > Sj.  A decreasinp.  chain  of 
points  in  S is  a sequence  S|,S2,  .,s^  such  that  S|  i S2  > S3  > ...  > s^,.  The  following 
theorems  characterize  the  set  of  points  in  an  optimal  solution. 

Theorem  2:  Let  S be  i set  of  points  as  above.  If  Sj  < S belongs  to  an  optimal 
solution  , then  all  points  Sj  ^ Sj  such  that  Sj  •>  Sj  are  also  contained  in  an  optimal  solution. 

Proof:  Let  Sj  ( D',  an  optimal  solution.  For  a given  pair  Sj,  Sj,  let 

A(D)  = A(D,i)  - A(D,j).  (Here,  A(D,i)  stands  for  E(D  U {s,})  - E(D)).  Assume  Sj  < D’. 
Consider  the  set  D = D'  - {sj}.  Since  D'  is  an  optimal  solution,  A(D,j)  < 0. 


Claim:  It  suffices  to  show  that  A(D,j)  < 0 A(D)  < 0 and  A(D,j)  * 0 A(D)  < 0. 
This  follows  because,  if  A(D,j)  < 0 then  A(D)  < 0 so  that  A(D,i)  < A(D,j).  Since  we  have 
assumed  Sj  / D',  we  get  E(D  U {Sj})  < E(D'),  contradicting  the  optimality  of  D'.  Thus  Sj  < D', 
an  optimal  set.  If,  on  the  otner  hand,  A(D,j)  = 0 then  A(D)  < 0 and  so,  A(D,i)  < 0.  If 
A(D,i)  < 0 wo  got  a contradiction  as  before,  and  so  s,  < D’,  an  optimal  solution.  Finally,  if 
A(D,i)  «=  0,  E(D  II  (Sj))  = E(D')  which  means  D U {sj}  is  also  an  optimal  set,  which  again, 
proves  the  theorem. 

To  see  why  the  claim  holds,  A(D)  = ‘^(i)[K(i)  - G(D)]  - F(j)[K(j)  - G(D)]  = 


[F(i)  - F(j)][K(j)  - G(D)]  ♦ F(i)[K(i)  - K(j)]. 
(but  Sj  i'  Sj  so  F(i)  > F(j)  or  K(i)  < K(j)). 


Since  s,  > Sj  we  have  F(i)  > F(j)  and  K(i)  < K(j) 
Thus,  if  A(D,j)  ^ 0 (io.,  K(j)  - G(D)  < 0),  then 


i 


1 


A(D)  < f',  while  if  A(D,j)  . 0,  (i.e.,  K(,)  = G(D))  then  AID)  < 0 (if  K(i)  X K(j))  or  A(D)  = 0 (if 
K(i)  = K(j)).  In  i;ny  case,  the  theorem  is  proved.  | 
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Theorem  2 says  that  if  an  opfimal  solution  contains  a point  s,  then  all  points  in  a 
decreasing  chain  ending  in  s are  also  part  of  an  optimal  solution.  Using  this  fact,  the  search 
for  an  optimal  solution  can  be  organized  as  follows:  Parlition  C into  the  smallest  set  of 
disjoint  descending  chains.  Let  w be  the  number  of  such  chains.  The  set  of  candidates  to 
be  adjoined  to  a current  partial  solution  D is  obtained  by  considering  the  subset  of 
independent  points  among  the  set  of  points  which  are  maximal  in  each  cham.  Thus,  at  each 
step  of  the  search,  at  most  w points  have  to  be  considered.  If  mj  ,m2,...,rr.^  are  the  number 
of  points  in  each  chain,  the  maximal  number  of  sets  examined  during  the  entire  search  will  be 
less  than  { 1 ♦m j )(1  *m2)  ...  (1  *m^)  < (1  ♦m/w)'*'.  Assuming  that  all  points  have  different 
components,  the  value  of  w turns  up  to  bo  the  length  of  a longest  increasing  sequence  in  a 
bermutation  of  m elements.  There  is  no  known  expression  for  the  average  of  this  quantity 
but  empirical  studies  [1]  have  shown  that  the  asymptotic  behavior  is  2m®-^.  (Recently, 
Steckin  has  shov/n  [13]  that  this  average  is  bounded  above  by  em®’^).  So,  an  upper  bound 
for  the  average  number  of  sets  examined,  assuring  all  permutations  being  equally  likely  is 

(l*m0-5/2)2'^°^  .0(2"’°'^  'OS 

The  partial  order  defined  in  the  definition  above,  has  induced  a precedence  in  the 
Order  in  which  points  have  to  be  examined.  Theorem  2 established  that  this  precedence  was 
partial  as  nothing  was  established  for  independent  points.  Theorem  3 will  provide  conditions 
under  which  a precedence  can  be  established  for  these  independent  pairs  of  points.  Notice 
that  if  a precedence  could  be  established  for  all  independeni  pairs,  then  a total  precedence 
would  exist  and  a linear  scan  would  determine  the  optimal  set. 

Theorem  3:  Let  i,  j be  two  independent  points  in  S such  that  F(i)  < F(j)  and 
K(i)  < K(j).  If  A(D,i)  < A(D,j)  for  some  D then  A(D',i)  < A(D',j)  for  any  superset  D'  of  D. 
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Proo^:  As  in  the  proof  of  Theorem  2,  we  have  for  D'  o D, 

A(D')  - AID)  = [F(i)  - F(j)][G(D)  - G(D')]  < 0,  since  G(D)  is  r-ni, 

Thus  A(D')  < A(D)  < 0 as  was  to  be  shown.  | 

Theorem  3 says  that,  if  while  performing  the  depth  first  search  procedure, 

whenever  two  points  in  an  independent  set  can  be  chosen  to  be  included  in  a set  D and  the 
one  with  smaller  value  of  the  F function  is  preferable  to  me  other  (i.e.,  it  decreases  more 
the  value  of  the  cost  function),  it  remains  preferable  at  any  later  stage  of  the  search,  which 
extends  the  current  set  D.  Thus,  at  some  point  during  a partial  search  we  may  discover  a 
precedence  between  two  points  in  an  independent  set.  Using  these  results  we  may  give  the 
following  informal  description  of  an  algorithm  to  find  the  optimal  set  of  indices  from  a set  S 

specified  by  tuples  (F(i),K(i)).  The  algorithm  keeps  track  of  the  precedence  that  exists 
between  points. 

1 . Initialization  (Dq  is  the  current  choice  for  global  optimum,  ogt  is  the  lowest  value 
of  EfOg)  obtained  so  far,  R is  a pushdown  stack  whose  entries  are  pairs  D,,Pj,  where  Dj  E S, 
Pj  is  a directed  graph  with  at  most  |S1  points):  Dg  *-  <>,  opt  = oo.  Define  an  initial  directed 
graph  Pj^j,  as  follows:  Nodes  are  all  points  i ( S such  that  A(i(>,i)  < 0.  (Points  i with  A(*>,i) 
2 0 are  never  included  in  an  optimal  solution  so  they  need  not  be  considered.'  Node  i is 
directed  to  j if  cither  F(i)  2 F(j)  and  K(i)  < K(j)  (thus  i and  Theorem  2 applies)  or  F(i)  < 
P(j),  K(i)  < K(j)  and  A((Ji,i)  < A((^,j)  (by  Theorem  3).  Let  R (t.Pjmj). 


2.While  R ^ empty  do 
bep.in  ’ (D,P)  R;  (pop  the  stack) 

Delete  all  nodes  i in  P such  that  A(D,i)  > 0 or  i ( D; 

If  P « ♦ and  Vj  < D (A(D  - {j},j)  < 0 ) (i.e.,  found  local  optimal) 
then  if  opt  E(D) 

then  (local  optimal  found  is  best  so  far)  bep.in  opt  «-  E(D),  Dq  ♦-  D end 
else  begin  (let  Source  (P)  be  the  set  of  all 

nodes  in  P having  no  ingoing  directed 
edges.  Note  that  Source  (P)  is  an  independent 
set.  Let  P*  be  the  graph  obtained  by  augmenting 
P by  joining  i i Source  (P)  to  j ( Source  (P) 
whenever  K(i)  < K(j)  and  A(D,i)  < A(D,j)). 

For  each  i < Source  (P'),  let  R *-  (D  U {i}>P’)j  (push  the  stack) 

end; 

end; 

As  was  mentioned  above,  an  upper  bound  on  the  asymptotic  average  running  time  of 

05  I 

a deterministic  version  of  this  algorithm  is  0(2*^  ' "’t  which  is  a big  improvement  over 

2^  obtained  by  enumeration.  Empirical  studies  with  it  have  shown  that  even  this  reduced 
upper  bound  is  still  much  higher  than  the  actual  number  of  nodes  visited. 

CONCLUSIONS 

The  problem  of  index  optimization  has  been  solved  under  very  general  assumptions 
and  properties  of  the  optimal  solution  have  been  found  which  allows  the  existence  of  an 
efficient  algorithm  to  determine  the  solution.  It  is  easy  to  see  that  previously  reported 
methods  for  solving  this  problem  ((7),  [12],  [lA])  are  special  cases  of  the  results  shown 
here. 
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